Learning Objectives

  • Explore how we can represent different kinds of data
  • Communicate with visualizations
  • Articulate when visualizations are misleading

Remember

Be patient, curious, and ask questions. You can do this!

Why Visualization?

Visualization is good for exploring data because we are really good at evaluating data visually.

  • We need to become aware of patterns in the data.
  • Sometimes these patterns are desirable:
    • Associations

http://www.ph.ucla.edu/epi/snow/mapsbroadstreet.html

Undesirable Variation

  • Sometimes they are not desirable:
    • Experimental Artefacts

Why are there spikes in the data here?

Systolic Blood Pressure Graph http://care.diabetesjournals.org/content/30/8/1959

How we measure blood pressure

Any thoughts after looking at this?

Sphygmomanometer

https://commons.wikimedia.org/w/index.php?curid=7050866

Lots of Ways to Represent Data Visually

This is an non-exhaustive list of visualizations. We won’t talk about all of these, as some of them are very specialized.

This graphic is mostly to show you that there are lots of ways to visualize data, and which one you use is dependent on what parts of the data you want to emphasize.

What a graph needs

  • Title (what it represents)
  • Labeled Axes (with units, preferably)
  • Variables and how they are represented
  • Scales
    • Is it a count or is it a proportion?
  • How many variables and their types
    • Including ones used to dictate colors, shapes, patterns, sizes, opacities, etc.
    • Independent (“predictor”) variables (e.g. time) are usually on the X (horizontal) axis
      • Occasionally time is plotted on the vertical axis for specific reasons
    • Dependent (“outcome” / “response”) variables are usually on the Y (vertical) axis

The Data (NHANES)

We’re going to look at data from a study called NHANES (National Health and Nutrition Examination Survey).

This survey is meant to represent the United States (via a complicated sampling design) in terms of nutritional habits, physical activity, drug use, and outcomes (depression, diabetes, and physical activity).

What does the data actually look like?

Here are the first 10 rows of the data table. Each row of the data corresponds to a patient. Each column is called a variable (if you look at the top of the table, you can get extra information about what data type that variable is).

## # A tibble: 10 x 5
##      BMI Diabetes Gender   Age BMIstatus
##    <dbl> <fct>    <fct>  <int> <fct>    
##  1  32.2 No       male      34 High     
##  2  32.2 No       male      34 High     
##  3  32.2 No       male      34 High     
##  4  30.6 No       female    49 High     
##  5  27.2 No       female    45 Low      
##  6  27.2 No       female    45 Low      
##  7  27.2 No       female    45 Low      
##  8  23.7 No       male      66 Low      
##  9  23.7 No       male      58 Low      
## 10  26.0 No       male      54 Low

Summarizing Data

Oftentimes, we’ll have so many rows of the data (sometimes even 1 million rows!), that it’s impossible to look at it all in a table.

For this reason, we often try to generate summaries of the data to give us an idea of what is in the data.

Let’s take a look at two types of summaries of the data. The first is a visual summary of the data.

Each line of this summary corresponds to a row in the dataset. You can see whether each row is complete (whether it has entries for each variable), or whether data is missing, which is represented as grey lines, and shows up as NA in the data.

Understanding why the data is missing is a really important part of being a data scientist. We’ll look into more missing values in our next session.

The second is a more tabular summary where we can get more information about the different variables.

summary(bmi_diabetes)
##       BMI        Diabetes       Gender          Age        BMIstatus  
##  Min.   :15.02   No  :6350   female:3615   Min.   :21.00   High:2544  
##  1st Qu.:24.20   Yes : 742   male  :3479   1st Qu.:33.00   Low :4487  
##  Median :27.81   NA's:   2                 Median :47.00   NA's:  63  
##  Mean   :28.84                             Mean   :47.71              
##  3rd Qu.:32.23                             3rd Qu.:60.00              
##  Max.   :81.25                             Max.   :80.00              
##  NA's   :63

Categorical (or factor) Data

Categorical data has a limited number of possible values. Some examples include: your home state (50 possible values), country of birth, what kind of phone you have (iOS/Android).

Let’s look at Diabetes in our patients

One tool we often use to explore the data is a table, which counts how much of each category is there. Which group is larger?

table(bmi_diabetes$Diabetes)
## 
##   No  Yes 
## 6350  742

Here’s another way to look at the data, a bar plot. You can instantly see which of the groups is larger.

bmi_diabetes %>% tidyr::drop_na() %>% ggplot(aes(x=Diabetes)) + geom_bar() + ggtitle("Summary of Patients: Diabetes Status")

Sometimes it is easier to look at a table, but usually a bar chart will make things more obvious.

Stacked Barplots

We can also stack values on top of each other to make a stacked barplot. This can be useful when comparing categories.

Look at the scale

What does the y-axis represent in each of these graphs?

Stacked Barplots vs Pie Charts

Pie charts are ok for variables that just have two or three categories, but they become difficult to compare across two pie charts.

Take a look at the barchart and pie chart. Which group (male or female) has the larger proportion of Diabetes patients? (Note that the scale has changed!)

Continuous Data

Now, let’s look at how continuous (numeric) data can be represented.

Histograms

A histogram shows the distribution of a continuous variable by splitting it into bins and counting how many observations fall into each bin (left).

You can think of a histogram as a bar graph where the x variable is numeric.

Definition edited from Mikhail Popov

Histogram: Play with the Bins

To make a histogram, we have to bin the data, or convert the continuous data into ranges.

Try adjusting the number of bins for the histogram. What details gets lost as you adjust the number of bins downwards?

All the different ways to represent a continuous variable

Try out all the different ways you can represent a continuous variable in a distribution. Which one do you like?

Something weird about age

Let’s look at our other continuous covariate, Age. Does the dotplot show anything weird to you?

Relationships: 2 categorical variables

Now we’re going to delve into ways of visualizing the relationships between two categorical variables.

Association

##          Diabetes
## BMIstatus  No Yes
##      High  58  14
##      Low  113  12

Proportional versus Frequency

Change the barplot type from regular to proportional. What changes? (look at the scale before/after).

Which barplot type helps you answer the question: “Is there a larger proportion of people with Diabetes with High BMI compared to those who have a low BMI?”

Which barplot helps you understand how many people had Diabetes with low BMI?

Relationships: 1 categorical, 1 continuous

Now we’re going to investigate the relationship between one variable that is continuous and another variable that is categorical.

Quick Review of Mathematical Terms

Make sure you understand the following terms before you go on. Here’s a quick review in case you need it.

  • percentile
  • median
  • mean

Boxplots

When you are trying to understand the relationship between a continuous variable and a categorical variable, you want to use a boxplot.

A boxplot allows you to visually compare the distributions by way of a five number summary which includes:

  • Sample minimum (the smallest value)
  • First quartile (Q1) which is the 25th percentile
  • Second quartile (Q2) also known as the median. By definition, this is the middle of the data - 50% of the data is above this value and 50% is below this value.
  • Third quartile (Q3) which is the 75th percentile
  • Sample maximum (the largest value)
  • Outliers (very large or small values) are represented as dots.

Example and definition modified from Mikhail Popov

Understanding Boxplots

Try the different categorical variables out in the data. Is there a difference in medians between the categories? Is that difference meaningful?

Boxplots do lose some detail

Take a look at the boxplot and the violin plot side by side. What do you notice?

Relationships: 2 continuous variables

Scatterplots

Scatterplots show the relationship between two continuous variables. We can add what’s called a fit line between the two variables. Ideally, we want to see a line that is diagonal as possible. A flat line means there is no association between variables.

In this case, we are showing that knowing the values of one variable (Weight) means that you know something about another variable (BMI). Anyone know why?

Look at the relationship between continuous variables

Now you can try to get a feel for what correlation (linear and non-linear) looks like. Try a few pairs, such as (BMI and Total Cholesterol), (BMI and Testosterone), and (Height, Income). Before you look at the data, make a guess whether there is a relationship between variables.

Do you see a relationship between these two variables? How strong is the relationship?

Code adapted from DSIExplore by Jess Minnier

Relationships: 1 variable versus time

Time Series

Time series data is everywhere. We are often interested in looking at overall trends over time, or differences in trends over time.

Many types of clinical data can be represented as time series:

  • Lab Information
  • Patient Visits
  • Nursing Shifts

We might be interested in questions of volume (is the number of visits to the emergency room increasing over time?), or whether a treatment for a patient shows improvement or not (does taking Blood Pressure Medicine improve our blood pressure after two weeks?).

The data below are the counts of major categories of crimes by month/year from Portland Police. Before we visualize the data, here are the first few rows of the dataset:

## # A tibble: 6 x 3
## # Groups:   offense_category [1]
##   offense_category crime_yearmonth count
##   <fct>            <date>          <int>
## 1 Assault Offenses 2016-01-01        596
## 2 Assault Offenses 2016-02-01        487
## 3 Assault Offenses 2016-03-01        577
## 4 Assault Offenses 2016-04-01        590
## 5 Assault Offenses 2016-05-01        622
## 6 Assault Offenses 2016-06-01        626

Line charts are the most common way to visualize time series data, with time usually as the horizontal X axis and range of a quantitative variable as the vertical Y axis. Try mousing along a line - you will get more information about each data point.

If the graph is too cluttered for you, try clicking on the legend to remove traces.

This is another plot called a stacked-area plot. This plot shows the cumulative numbers, and can be useful in showing how the total number of something is increasing or decreasing over time.

Finally, we can do a proportional stacked area plot. Much like the proportional barplot, we can see whether the proportion of any one category increases or changes over time.

Because there are multiple categories and the proportions are changing for each, it’s really hard to tell overall trends for this data.

Spatial Data

Spatial data is rapidly becoming very important, as it allows us to summarize spatial differences.

The following data is from Biketown about where people rented their bikes. Here’s what the data looks like as individual points. Note the location of each point (which corresponds to a bike rental) has two variables: Latitude and Longitude.

## # A tibble: 6 x 4
##   RouteID StartLatitude StartLongitude PaymentPlan
##     <dbl>         <dbl>          <dbl> <chr>      
## 1 4484904          45.6          -123. Casual     
## 2 1613456          45.5          -123. Casual     
## 3 4166261          45.5          -123. Subscriber 
## 4 5280514          45.5          -123. Subscriber 
## 5 4067734          45.5          -123. Subscriber 
## 6 5867044          45.5          -123. Casual

The data can be aggregated by neighborhood, if we have the spatial boundaries of each neighborhood. We can count the number of rentals within each boundary to summarize by neighborhood:

## # A tibble: 6 x 2
##   name                           num_rentals
##   <fct>                                <int>
## 1 CATHEDRAL PARK                           1
## 2 UNIVERSITY PARK                          1
## 3 PIEDMONT                                 6
## 4 WOODLAWN                                 7
## 5 CULLY ASSOCIATION OF NEIGHBORS           6
## 6 ARBOR LODGE                              5

Below is an interactive map of Portland and Biketown Rentals. The map can toggle between the points themselves and a chloropleth, which summarizes the data by the number of points within a smaller area of the map, in this case, by neighborhood.

Which is easier to summarize by neighborhood? What neighborhood has the largest number of bike rentals?

Spatial data becomes extremely powerful when they are combined with other information. What if we asked questions about Asthma Incidence and Overall Levels of pollution?

In this example, location is clearly one Social Determinant of Health, and using spatial analysis to highlight inequities in populations is very exciting and may allow us to better address these heatlh inequities.

Social Determinants of Health From https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Social_Determinants_of_Health_Infoviz.jpg/1068px-Social_Determinants_of_Health_Infoviz.jpg

Example: Smoking, Death, and Age

Smokers versus non-smokers

The following dataset represents a set of patients who for which we know their outcome (dead/alive), their smoker status (yes/no), and their Age.

Look at the plot very carefully before answering the question below:

The following is a plot of the outcome status versus the age of the patient in the dataset. Remember, you can click on the legend to remove traces on the graph.

Relationship between age and smoking/outcome

Let’s assess what happens with number of deaths as we remove older patients from the data.

Adjust the slider and see what happens to the proportion of smokers for each outcome group as we remove the older patients from the dataset.

Quiz Yourself

Distributions of annual income

Take a look at the distributions of annual income graph here and compare 1960 to 2016.

https://flowingdata.com/2016/06/28/distributions-of-annual-income/

Quiz Yourself: Looking at Multiple Categories

The Bechdel Test is a test to determine whether women are represented fairly in movies.

The Bechdel Test, sometimes called the Mo Movie Measure or Bechdel Rule is a simple test which names the following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.

The following is a dataset where the proportions of each of these categories are tabulated for movies over a range of time.

Bechdel Test in Movies

What about the subcategories, such as Women only talk about men? Is it easy or hard to see if they are changing with time?

Did your senatorial vote predict your presidential vote?

Take a look at the scatterplot below. Each dot represents a state polled and their percentage Republican or Democrat. Is there a relationship between knowing the 2012 presidential results and the 2014 presidential results?

Misleading Visualizations: Visualization Lies

Not all visualizations are good. Some are just plain misleading.

Take a look at the following page. What do you think is the most decieving use of visualization?

https://flowingdata.com/2017/02/09/how-to-spot-visualization-lies/

Take Home Points

  • 2 categorical variables: barplots
  • 1 categorical variable, 1 continuous variable: boxplots
  • 2 continuous variables, scatterplots
  • 1 variable versus time: lineplot/stacked area plots
  • 1 variable versus location: marker map/chloropleth
  • Need to understand what the graph is trying to communicate
  • Not all graphs are good - people can use them to mislead

Acknowledgements

Portions of this tutorial are partially adapted from:

Going beyond: Aesthetics

The following section is optional.

Aesthetic Properties: Color

Beyond our basic plots, we might want to map different measurements to different visual properties.

For the 2x2 bar plots, we mapped Diabetes to color.

For scatterplots, we might want to code categorical data as color like the example below:

Aesthetic Properties: Shape

Journals used to only allow black and white figures, so color couldn’t be used. One alternative is mapping our Diabetes Variable to shape

Waterfall Plots (optional)

Another way to visualize the relationship between a continuous variable and a categorical variable is a waterfall plot. With a waterfall plot, we plot each continuous value as a single bar and sort them increasing value. Each bar is then colored by the categorical variable.

Each of these bars represents a person who smokes (Y/N) and their age. If you are older than 60, are you less likely to smoke?

Graph literacy https://concord.org/wp-content/uploads/2016/12/pdf/teaching-graph-literacy-across-curriculum.pdf